Results from A1

We will nest our A1 document in this notebook to continue our analysis.

Dataset Information

Publication Title: Transcriptomic analyses reveal rhythmic and CLOCK-driven pathways in human skeletal muscle

Publication Date: 2018-04-16

Publication Journal: eLife

GEO ID: GSE108539

The dataset has intronic and extronic regions for each sample. The groups were defined as per the following definitions [1]:
(1) Exonic: if it occurs in at least one of the transcripts
(2) Intronic: if it is shared between all the transcripts
Also, genes with less than two intronic reads or 10 exonic reads on average were discarded.

Setup and package installation:

Downloading dataset and supplementary files

Information about Platform

Platform Title: Illumina HiSeq 2500 (Homo sapiens)
Original submission date: Mar 14 2013
Last update date: Mar 27 2019
Organism: Homo sapiens
No. of GEO datasets that use this technology: 6080
No. of GEO samples that use this technology: 167630

Here we are creating a table to show the groups in our sampling.

We have 16093 genes expressed in the dataset. These genes are based on 114 samples from 10 study participants. Each participant was sampled at 6 times in 4 hour intervals.

Cleaning, Filtering and Normalizing

The available count data on GEO was already cleaned, normalized and filtered as described in [1]. The authors of the dataset performed the following pre-processing procedure:
1. Transcripts with lower than 3 CPM were removed
2. Transcripts that were not aligned were removed
3. Scaling factors were determined using the Trimmed M-values method [2] was applied

We will first check if any genes are duplicated based on the ensembl ids provided:

We find the no duplicates, due to the fact that the data has already been cleaned.

Both box plots show us a very well normalized set of data. No outliers can be seen from this. A key difference to note here is the difference in medians between intron and exon sequences.

Again we see a similar trend as we did in the box plot. Intron sequences (green lines) are distributed evenly but are quite distinct from exon sequences (blue).

From the MDS plot, overall we see the clustering of majority of intron and exon samples. The more interesting pattern is found when observed the samples that are not clustered with the others; we see the y-axis values are similar introns and exon samples with the same participants and time. Hence, there does not seem to be any significant outliers, and we will continue downstream analysis with all samples.

Next, we will show common vs tagwise dispersion. Dispersion measures the how much our counts deviate from the mean.

This is not an expected or normal dispersion plot. Typically, the variance should decrease with increase in log CPM, but we have the opposite here. Investigation (reading paper, googling the issue, etc) to correct this was conducted but to no avail. Further investigation will be conducted to understand this.

Next, we will plot the mean and variance relationship.

In this plot, the blue line shows the Negative Binomial. The black line shows the Poisson mean-variance relationship. The best fit of the data is marked by the red crosses. The data seems to have been normalized well and there is no overdispersion.

Identifier Mapping

Here we will get information on the identifier datasets and map the ensembl IDs to the current HUGO symbols:

Here we see that there is ensembl_gene_id, exactly what we are looking for. We will use this to map the ensembl ids in our dataset.

Now that we have mapped the genes to HGNC symbols, we will check if there are any ensembl ids that have duplicate HGNC symbols:

Hence, we have 3 genes that have duplicate mapping. Since this is a very small number relative to our overall database, we will not filter these out just yet.

The difference between number of mappings we found and number of genes in the normalized dataset is 148

This number however does not tell the fully story. We will merge the mappings with the normalized dataset to get a better picture of the mapping.

Based on our merged ensembl ids to HGNC symbols, we will identify the number of genes that did not match:

148 identifiers are missing.
148 out of 16093 have not been mapped. This accounts 0.92% of the genes that were not identifed. To address the unidentified genes, we will attempt to use a new tool called Biobtree [3]. This is an alternative tool to biomarRt, using a different dataset, hence we may be able to map some more unidentified genes.

Questions

What are the control and test conditions of the dataset?
The control condition was skeletal muscle biopsies collected from human samples who were placed under rigid laboratory conditions. Biopsies were taken every 4 hours across a 24 hour period from 10 individuals. The test conditions were performed on cultured human primary skeletal myotubes (hSKM). Though analyzing the cultered data analysis is beyond the scope of the course, there are several groups that the dataset can be clustered into (i.e. by demographic of timepoint) for differential analysis.

Why is the dataset of interest to you?
I have been engaged in sleep research for the past few years at the level of physiological signals. Given my knowledge and previous experience in the sleep domain, exploring a dataset with biological signals related to sleep and circadian rhythms seemed quite fitting.

Were there expression values that were not unique for specific genes? How did you handle these?   There were 3 values were each mapped to 2 genes. This is an insignificant amount of replication so we will simply filter both replications of these genes out in downstream analysis.

Were there expression values that could not be mapped to current HUGO symbols?
Yes there were 148 expression values that could not be mapped to current HUGO symbols.

How many outliers were removed?
No outliers were removed. There are some potential outliers shown in the MDS plot, but seem consistent enough to consider. We will keep all samples for now as they may reveal important insight in downstream analysis.

How did you handle replicates?
There were no replicates in the dataset provided as it had already been cleaned prior to uploading to GEO.

What is the final coverage of your dataset?
(1) The dataset provided 16093 genes.
(2) However, the dataset provided was already pre-filtered according to the conditions described in the Filtering and Normalizing section.
(3) We were able to map 15945 ensembl ids to current hgnc symbols.
(4) We then identified 3 genes that were replicated, which will be filtered out downstream.   Overall, we are left with 1.59410^{4} genes to work with, accounting for 99.0492761% of the genes in the dataset. Not too bad :)

References

[1]L. Perrin et al., “Transcriptomic analyses reveal rhythmic and CLOCK-driven pathways in human skeletal muscle,” eLife, vol. 7, p. e34114, Apr. 2018, doi: 10.7554/eLife.34114.

[2]M. D. Robinson, D. J. McCarthy, and G. K. Smyth, “edgeR: a Bioconductor package for differential expression analysis of digital gene expression data,” Bioinformatics, vol. 26, no. 1, pp. 139–140, Jan. 2010, doi: 10.1093/bioinformatics/btp616.

A2 starts here

Data Overview:

The dataset being used in this analysis is based on what is published in [1]. The dataset samples 10 participants at 6 different timepoints including during the day and during the night (while they were asleep) to get an idea of the key genes involved in circ #### Analysis Objectives:

First we will install the necessary packages:

if (!requireNamespace("BiocManager", quietly = TRUE)) {
   install.packages(pkgs = c("BiocManager"),
           repos = "http://cran.rstudio.org",
           dependencies = TRUE,
           quiet = TRUE) 
}
if (!requireNamespace("circlize", quietly = TRUE)) {
   install.packages(pkgs = c("circlize"),
           repos = "http://cran.rstudio.org",
           dependencies = TRUE,
           quiet = TRUE) 
}

if (!requireNamespace("gprofiler2", quietly = TRUE)) {
   install.packages(pkgs = c("circlize"),
           repos = "http://cran.rstudio.org",
           dependencies = TRUE,
           quiet = TRUE) 
}

BiocManager::install("ComplexHeatmap")
BiocManager::install("limma")
BiocManager::install("GOstats")

library(Biobase)
library(circlize)
library(limma)
library(grid)
library(gprofiler2)

normalized_count_data <- normalized_counts_annot_filtered

First we will setup or data variables:

intron_cols <- grep("Intron.*|ensembl_gene_id|hgnc_symbol*", names(normalized_count_data), value = TRUE)
exon_cols <- grep("Exon*|ensembl_gene_id|hgnc_symbol", names(normalized_count_data), value = TRUE)
normalized_count_data_introns <- normalized_count_data[ , intron_cols]
normalized_count_data_exons <- normalized_count_data[ , exon_cols]

heatmap_matrix <- normalized_count_data[ ,3:ncol(normalized_count_data)]
rownames(heatmap_matrix) <- normalized_count_data$ensemble_gene_id
colnames <- colnames(normalized_count_data[ ,3:ncol(normalized_count_data)])


heatmap_matrix_intron <- normalized_count_data_introns[ ,3:ncol(normalized_count_data_introns)]
rownames(heatmap_matrix_intron) <- normalized_count_data_introns$ensemble_gene_id
colnames <- colnames(normalized_count_data_introns[ ,3:ncol(normalized_count_data_introns)])

heatmap_matrix_exon <- normalized_count_data_exons[ ,3:ncol(normalized_count_data_exons)]
rownames(heatmap_matrix_intron) <- normalized_count_data_exons$ensemble_gene_id
colnames <- colnames(normalized_count_data_exons[ ,3:ncol(normalized_count_data_exons)])

Note: Code for Heatmap matrix for unthresholded dataset is commented out and hidden as it takes too long to execture.

Here we are preparing a matrix to test for differences in our key gene of interest (CLOCK) gene, between our experiment groups.

gene_of_interest <- which(normalized_count_data_exons$hgnc_symbol == "CLOCK")
participants <- c("Exon_15", "Exon_3", "Exon_2","Exon_6", "Exon_7", "Exon_9", "Exon_13", "Exon_4", "Exon_12", "Exon_1")
timepoint_clock_matrix <- matrix(nrow=10, ncol=0)
row.names(timepoint_clock_matrix) <- participants
create_timepoint_matrix <- function(grep_pattern, timepoint) {
  tp <- grep(colnames(normalized_count_data_exons),
                          pattern=grep_pattern)
  m <- (t(normalized_count_data_exons
                       [gene_of_interest, tp]))
  temp_rownames <- strsplit(rownames(m), " ")
  new_rownames <- lapply(temp_rownames, function(x) {
    new_name <- unlist(strsplit(x, "_"))[1:2]
    new_name <- paste(new_name[1], new_name[2], sep="_")
    return(new_name)
  })
  colnames(m) <- c(timepoint)
  row.names(m) <- new_rownames
  return(m)
}
t1 <- create_timepoint_matrix("Exon.*_12.00", "12.00")
t2 <- create_timepoint_matrix("Exon.*_16.00", "16.00")
t3 <- create_timepoint_matrix("Exon.*_20.00", "20.00")
t4 <- create_timepoint_matrix("Exon.*_00.00", "00.00")
t5 <- create_timepoint_matrix("Exon.*_04.00", "04.00")
t6 <- create_timepoint_matrix("Exon.*_08.00", "08.00")
matricies <- list(t1, t2, t3, t4, t5, t6)
for (l in matricies) {
  timepoint_clock_matrix <- cbind(timepoint_clock_matrix, l[, 1][match(rownames(timepoint_clock_matrix), rownames(l))])
}
colnames(timepoint_clock_matrix) <- c("12.00", "16.00", "20.00", "00.00", "04.00", "08.00")
timepoint_clock_matrix
##            12.00    16.00    20.00     00.00    04.00     08.00
## Exon_15 2.018295 2.359913 2.446082 2.5795096 2.462013 0.7254062
## Exon_3  1.668517 1.909955 2.125634 1.9915728 2.019576 0.7396265
## Exon_2  1.920443 1.972585 2.356818 2.2837716 2.296801 2.0533181
## Exon_6  1.612411 1.659358 1.951415 2.0688471 1.955065 1.8176283
## Exon_7  1.798982 1.639746 2.054475        NA       NA        NA
## Exon_9  1.706860 1.810690 2.045504 2.2416810 1.988155 1.6351701
## Exon_13 1.796529 1.769102 1.968643 2.0990352 2.106858 1.8899544
## Exon_4  1.878297 2.103864 2.220510 2.4258696 2.149684 1.8286810
## Exon_12 1.537008 1.638882 1.976554 0.1883507 2.360112 1.7235454
## Exon_1  1.630994 1.613856 1.644382 1.4006078 1.610527 1.2192292

To compare the CLOCK gene across our timepoint groups, we will perform an ANOVA (rather than a t-test) as we have more than 2 groups in the data. We will need to transform the matrix created in the previous block to use these data in the ANOVA function.

df_clock_timepoint <- data.frame()
for (col in 1:ncol(timepoint_clock_matrix)) {
  values <- as.data.frame(timepoint_clock_matrix[col, ])
  timepoints <- rep(c(colnames(timepoint_clock_matrix)[col]), 6)
  values <- cbind(values, timepoints)
  colnames(values) <- c("values", "timepoints")
  rownames(values) <- NULL
  df_clock_timepoint <- rbind(df_clock_timepoint, values)
}
df_clock_timepoint[1:5, ]
aov_clock <- aov(df_clock_timepoint$values ~ df_clock_timepoint$timepoints)
summary(aov_clock)
##                               Df Sum Sq Mean Sq F value Pr(>F)
## df_clock_timepoint$timepoints  5  0.741  0.1481   0.897  0.497
## Residuals                     27  4.456  0.1651               
## 3 observations deleted due to missingness

Our P value is >0.05, hence we equal variance between the timepoints.

Here we revisit the MDS plot developed in A1, but using the limma package. Again, we notice For the most part, samples in the intron and exon class cluster together with some outliers away from the 2 main clustering sites.

pat_colors <- rainbow(10)
pat_colors <- unlist(lapply(pat_colors,FUN=function(x){rep(x,2)}))
limma::plotMDS(heatmap_matrix_exon,
               col = pat_colors)

Typically, our samples cluster together, though we have 4 outliers to the right of the graph.

In the study related to the dataset, they are looking at which genes are expressed at the 6 different timepoints to get a sense of the genes involved in the human circadian rhythm.

Here, we will fit our data from each timepoint to a linear model. To do this, we will first create our experiment model design and will segregate our patient samples into groups by patient and timepoint.

samples <- data.frame(
        lapply(colnames(normalized_count_data_exons)[3:ncol(normalized_count_data_exons)], 
        FUN=function(x){
          unlist(strsplit(x, split = "_"))[c(2,4)]}))
colnames(samples) <- colnames(normalized_count_data_exons)[3:ncol(normalized_count_data_exons)]
rownames(samples) <- c("patient","timepoint")
samples <- data.frame(t(samples))
samples[1:5, ]

We will now create our model design. We have 6 different timepoints where data was collected, and so will we design the model to account for this.

model_design <- model.matrix(~ samples$timepoint)
model_design[1:5, 1:6]
##   (Intercept) samples$timepoint04.00 samples$timepoint08.00
## 1           1                      0                      0
## 2           1                      0                      0
## 3           1                      0                      0
## 4           1                      0                      0
## 5           1                      1                      0
##   samples$timepoint12.00 samples$timepoint16.00 samples$timepoint20.00
## 1                      1                      0                      0
## 2                      0                      1                      0
## 3                      0                      0                      1
## 4                      0                      0                      0
## 5                      0                      0                      0

Now we will our create our data matrix to apply the model to and get the P-values for our genes.

expressionMatrix <- as.matrix(normalized_count_data_exons[,3:ncol(normalized_count_data_exons)])
rownames(expressionMatrix) <- normalized_count_data_exons$ensembl_gene_id
colnames(expressionMatrix) <- colnames(normalized_count_data_exons)[3:ncol(normalized_count_data_exons)]
minimalSet <- ExpressionSet(assayData=expressionMatrix)
fit <- lmFit(minimalSet, model_design)
fit2 <- eBayes(fit, trend=TRUE)
topfit <- topTable(fit2, 
                   coef=ncol(model_design),
                   adjust.method = "BH",
                   number = nrow(expressionMatrix))

#merge hgnc symboles to the topfit table
output_hits <- merge(normalized_count_data_exons[,1:2],
                     topfit,
                     by.y=0,by.x=1,
                     all.y=TRUE)
#sort the output lists by decreasing pvaluee
output_hits <- output_hits[order(output_hits$P.Value),]

output_hits[1:5, ]
genes_pass <- length(which(output_hits$P.Value < 0.05))
genes_pass_correction <- length(which(output_hits$adj.P.Val < 0.05))

We have 1096 genes that pass the p-value threshold of <0.05. Further, we have 1 genes that pass the corrected P value threshold.

We have a sizeable amount of genes that pass the p-value threshold of 0.05, so we will represent these genes that passed through an MA plot and a heatmap.

xlim <- c(-3,1)
ylim <- c(-4,4)
passed_genes <- topfit[which(topfit$P.Value<0.05), ]
not_passed_genes <- topfit[which(topfit$P.Value>=0.05), ]
limma::plotMA(topfit, main="MA Plot: Passed Threshold Genes", ylim=ylim, xlim=xlim)

limma::plotMA(not_passed_genes, main="MA Plot: Other Genes", ylim=ylim, xlim=xlim)

As we can see the difference in the 2 plots, the p-value of 0.05 filters out a significant number of outliers.

We will now create a heatmap based on the thresholded gene list.

heatmap_matrix_new <- normalized_count_data_exons[,3:ncol(normalized_count_data_exons)]
rownames(heatmap_matrix_new) <- normalized_count_data_exons$ensembl_gene_id
colnames(heatmap_matrix_new) <- colnames(normalized_count_data_exons[,3:ncol(normalized_count_data_exons)])
heatmap_matrix_new <- t(scale(t(heatmap_matrix_new)))

top_hits <- output_hits$ensembl_gene_id[output_hits$P.Value<0.05]
heatmap_matrix_tophits <- t(
  scale(t(heatmap_matrix_new[
    which(rownames(heatmap_matrix_new) %in% top_hits),])))


if(min(heatmap_matrix_tophits) == 0){
    heatmap_col = circlize::colorRamp2(c( 0, max(heatmap_matrix_tophits)), 
                             c( "white", "red"))
  } else {
    heatmap_col = circlize::colorRamp2(c(min(heatmap_matrix_tophits), 0, max(heatmap_matrix_tophits)), c("blue", "white", "red"))
  }
current_heatmap <- ComplexHeatmap::Heatmap(as.matrix(heatmap_matrix_tophits),
                           cluster_rows = TRUE,
                           cluster_columns = TRUE,
                               show_row_dend = TRUE,
                               show_column_dend = FALSE, 
                               show_column_names = TRUE, 
                               show_row_names = FALSE,
                               show_heatmap_legend = TRUE,
                           column_title =  "Heatmap of top gene hits - Limma Model",
                                                       column_names_gp = gpar(fontsize = 8)

                               )
current_heatmap

We see significant more expression on the left and very right ends of this heatp. Looking at these time points, they are typically 0:00, 4:00, and 8:00, times that these participants were likely asleep.

Now we will continue our differential analysis to see differential expression using a different model, the Quasi liklihood model.

First we will get our filtered count data from A1 and create our Quasi liklihood model and show the head of the model.

exon_data_matrix <- inverse_data_matrix[ , grepl("Exon.*", colnames(inverse_data_matrix))]
d = DGEList(counts=exon_data_matrix, group=samples$timepoint)
d <- estimateDisp(d, model_design)
fit <- glmQLFit(d, model_design)
model_design[1:10, 1:5]
##    (Intercept) samples$timepoint04.00 samples$timepoint08.00
## 1            1                      0                      0
## 2            1                      0                      0
## 3            1                      0                      0
## 4            1                      0                      0
## 5            1                      1                      0
## 6            1                      0                      1
## 7            1                      0                      0
## 8            1                      0                      0
## 9            1                      0                      0
## 10           1                      0                      0
##    samples$timepoint12.00 samples$timepoint16.00
## 1                       1                      0
## 2                       0                      1
## 3                       0                      0
## 4                       0                      0
## 5                       0                      0
## 6                       0                      0
## 7                       1                      0
## 8                       0                      1
## 9                       0                      0
## 10                      0                      0
qlf.pos_vs_neg <- glmQLFTest(fit)
topTags(qlf.pos_vs_neg)
## Coefficient:  samples$timepoint20.00 
##                     logFC   logCPM        F       PValue         FDR
## ENSG00000169429 -7.908888 6.134667 39.07578 6.733919e-08 0.001083690
## ENSG00000119508 -4.092447 4.686381 36.61462 1.413032e-07 0.001136996
## ENSG00000125740 -5.695278 4.785103 34.88074 2.412801e-07 0.001294307
## ENSG00000158050 -3.994121 4.387527 32.76261 4.709932e-07 0.001727587
## ENSG00000120738 -4.676144 6.104781 32.35508 5.367511e-07 0.001727587
## ENSG00000122877 -4.652983 4.130170 31.52152 7.026935e-07 0.001884741
## ENSG00000271762 -3.912207 5.534392 30.56867 9.594209e-07 0.002000830
## ENSG00000144655 -3.281010 5.589754 30.37537 1.022459e-06 0.002000830
## ENSG00000248323 -4.051097 5.273990 30.10221 1.118963e-06 0.002000830
## ENSG00000081041 -3.327333 4.934630 29.46007 1.384968e-06 0.002228829

Next, we will look at how many genes pass the thresholds with the Quasi.

qlf_output_hits <- topTags(qlf.pos_vs_neg,sort.by = "PValue",
                           n = nrow(normalized_count_data))
quasi_pvalue_pass <- length(which(qlf_output_hits$table$PValue < 0.05))
quasi_corrected_pass <- length(which(qlf_output_hits$table$FDR < 0.05))

With the Quasi-liklihood model, 939 genes pass the threshold p-value <0.05 and 297 pass the correction.

Next we will compare the different models we generated (Limma vs Quasi liklihood).

qlf_pat_model_pvalues <- data.frame(
          ensembl_id = rownames(qlf_output_hits$table),
          qlf_patient_pvalue=qlf_output_hits$table$PValue)
limma_pat_model_pvalues <-  data.frame(
          ensembl_id = output_hits$ensembl_gene_id,
          limma_patient_pvalue = output_hits$P.Value)
two_models_pvalues <- merge(qlf_pat_model_pvalues,
                            limma_pat_model_pvalues,
                            by.x=1,by.y=1)
two_models_pvalues$colour <- "black"
two_models_pvalues$colour[two_models_pvalues$qlf_patient_pvalue<0.05] <- "orange"
two_models_pvalues$colour[two_models_pvalues$limma_patient_pvalue<0.05] <- "blue"
two_models_pvalues$colour[two_models_pvalues$qlf_patient_pvalue<0.05 & two_models_pvalues$limma_patient_pvalue<0.05] <- "red"
plot(two_models_pvalues$qlf_patient_pvalue,
     two_models_pvalues$limma_patient_pvalue,
     col = two_models_pvalues$colour,
     xlab = "QLF patient model p-values",
     ylab ="Limma Patient model p-values",
     main="QLF vs Limma")

As we can see, there is a small overlap between the two models, shown in red. Blue points represent genes overexpressed only in the Quasi model and yellow represent the genes overpressed only in the limma model. We will highlight some genes of interest, known to contribute to circadian rythym as described in [1].

ensembl_of_interest <- normalized_count_data_exons$ensembl_gene_id[
  which(normalized_count_data_exons$hgnc_symbol == "CLOCK")]
two_models_pvalues$colour <- "grey"
two_models_pvalues$colour[two_models_pvalues$ensembl_id==ensembl_of_interest] <- "red"
plot(two_models_pvalues$qlf_patient_pvalue,
     two_models_pvalues$limma_patient_pvalue,
     col = two_models_pvalues$colour,
     xlab = "QLF patient model p-values",
     ylab ="Limma Patient model p-values",
     main="QLF vs Limma")
points(two_models_pvalues[
  two_models_pvalues$ensembl_id==ensembl_of_interest,2:3],
       pch=24,  col="red", cex=1.5)

The red triangles are our genes of interest. We can see that our genes of interest strong correlation of expression in both models.

Now we will generate a heatmap for the Quasi Likelihood model.

top_hits <- rownames(qlf_output_hits$table)[output_hits$P.Value<0.05]
heatmap_matrix_tophits <- t(
  scale(t(heatmap_matrix_new[which(rownames(heatmap_matrix_new) %in% top_hits),])))
if(min(heatmap_matrix_tophits) == 0){
    heatmap_col = circlize::colorRamp2(c( 0, max(heatmap_matrix_tophits)), 
                             c( "white", "red"))
  } else {
    heatmap_col = circlize::colorRamp2(c(min(heatmap_matrix_tophits), 0, max(heatmap_matrix_tophits)), c("blue", "white", "red"))
  }
current_heatmap <- ComplexHeatmap::Heatmap(as.matrix(heatmap_matrix_tophits),
                           cluster_rows = TRUE,
                           cluster_columns = TRUE,
                               show_row_dend = TRUE,
                               show_column_dend = TRUE, 
                               col=heatmap_col,
                               show_column_names = TRUE, 
                               show_row_names = FALSE,
                               show_heatmap_legend = TRUE,
                               column_title =  "Heatmap of top gene hits - Quasi liklihood Model", 
                            column_names_gp = gpar(fontsize = 8)
)
current_heatmap

Compared to the limma heatmap, we see signficantly less differentiation here. However, the pattern of more differentation at timepoints 00:00 and 04:00 (times where participants are likely asleep) remains consistent between the two models.

Thresholded List

qlf_output_hits_withgn <- merge(normalized_count_data[,1:2],qlf_output_hits, by.x=1, by.y = 0)
qlf_output_hits_withgn[,"rank"] <- -log(qlf_output_hits_withgn$PValue,base =10) * sign(qlf_output_hits_withgn$logFC)
qlf_output_hits_withgn <- qlf_output_hits_withgn[order(qlf_output_hits_withgn$rank),]

upregulated_genes <- qlf_output_hits_withgn$hgnc_symbol[
  which(qlf_output_hits_withgn$PValue < 0.45
             & qlf_output_hits_withgn$logFC > 0)]
downregulated_genes <- qlf_output_hits_withgn$hgnc_symbol[
  which(qlf_output_hits_withgn$PValue < 0.05 
             & qlf_output_hits_withgn$logFC < 0)]

n_upreg_genes <- length(upregulated_genes)
n_downreg_genes <- length(downregulated_genes)

We have identified 1590 upregulated genes and 925 downregulated genes to be used in our gene enrichment analysis.

Enrichment Analysis

We will use the gprofiler package to a do a preliminary gene set analysis. We will perform a query with our upregulated genes and then our downregulated genes.

gostres_upreg <- gprofiler2::gost(upregulated_genes, organism = "hsapiens", ordered_query = FALSE,
  multi_query = FALSE, significant = TRUE, exclude_iea = TRUE,
  measure_underrepresentation = FALSE, evcodes = FALSE,
  user_threshold = 0.05, correction_method = c("g_SCS", "bonferroni",
  "fdr", "false_discovery_rate", "gSCS", "analytical"),
  domain_scope = c("annotated"), custom_bg = NULL,
  numeric_ns = "", sources = NULL)

gostres_downreg <- gprofiler2::gost(downregulated_genes, organism = "hsapiens", ordered_query = FALSE,
  multi_query = FALSE, significant = TRUE, exclude_iea = TRUE,
  measure_underrepresentation = FALSE, evcodes = FALSE,
  user_threshold = 0.05, correction_method = c("g_SCS", "bonferroni",
  "fdr", "false_discovery_rate", "gSCS", "analytical"),
  domain_scope = c("annotated"), custom_bg = NULL,
  numeric_ns = "", sources = NULL)

Manhattan Plot: Upregulated Genes

gprofiler2::gostplot(gostres_upreg, capped = TRUE, interactive = TRUE, pal = c(`GO:MF`
  = "#dc3912", `GO:BP` = "#ff9900", `GO:CC` = "#109618", KEGG = "#dd4477",
  REAC = "#3366cc", WP = "#0099c6", TF = "#5574a6", MIRNA = "#22aa99", HPA
  = "#6633cc", CORUM = "#66aa00", HP = "#990099"))

Manhattan Plot: Downregulated Genes

gprofiler2::gostplot(gostres_downreg, capped = TRUE, interactive = TRUE, pal = c(`GO:MF`
  = "#dc3912", `GO:BP` = "#ff9900", `GO:CC` = "#109618", KEGG = "#dd4477",
  REAC = "#3366cc", WP = "#0099c6", TF = "#5574a6", MIRNA = "#22aa99", HPA
  = "#6633cc", CORUM = "#66aa00", HP = "#990099"))

These two plots show us the gene sets that our upregulated and downregulated genes belong to. There are 11 gene sets for both the upregulated and downregulated genes. There is noticable different in the number of genes for upregulated vs. downregulated. For example, HPA and CORUM sets have significantly more genes in upregulated.

Questions - Differential Gene Expression

  1. How many genes were significantly differentially expressed? What thresholds did you use and why?
    • For the Limma model, we have 1096 genes that pass the p-value threshold of <0.05. This is a standard threshold used.
  2. Which method did you use? And Why?
    • The Benjamini-Hochberg method was used. This method seemed to have modeled our data well, passing a sizeable number of genes in both p-value threshold. and corrected threshold.
  3. How many genes passed correction? Further, we have 1 genes that pass the corrected P value threshold.
    • With the Quasi-liklihood model, 939 genes pass the threshold p-value <0.05 and 297 pass the correction. We have a sizeable amount of genes that pass the p-value threshold of 0.05, so we will represent these genes that passed through an MA plot and a heatmap.
  4. Do you conditions cluster together? Explain why or why not.
    • Typically we see a clustering of timepoints 0:00, 4:00 and 8:00, timings when our participants conditions are likely asleep.

Questions - Thresholded Over Representation Analysis

  1. Which method did you choose and why?
    • GProfiler was used because it is compatible with the hgnc symbols of the dataset being used. In addition, it was a fairly recent and updated tool (latest publication from 2019 [2]). It also has a R library and web server visualization tools (both on Web Server and through R), making it easy to navigate.
  2. What annotation data did you use and why? What version of the annotation are you using?
    • gProfiler provides serveral annonation sources and they were all used for experimentation purposes (GO:MF, GO:CC, GO:BP, KEGG, REAC, TF, MIRNA, HPA, CORUM, HP, WP). GO-BP returend the most number of sets and the version of this was releases/2019-07-01
  3. How many genesets were returned with what thresholds?
    • For upregulated genes, the domain size of GO:BP was 18842.
    • For downregualted genes, the domain size of GO:BP was 17847.
    • Note threshold was changed to 0.45 for upregulated to balance the number of genesets for upregulated vs downregulated
  4. Run the analysis using the up-regulated set of genes, and the down-regulated set of genes separately. How do these results compare to using the whole list (i.e all differentially expressed genes together vs. the up-regulated and down regulated differentially expressed genes separately)?
    • Answered above (under Manhattan Plot: Downregulated Genes portion).

References

[1] L. Perrin et al., “Transcriptomic analyses reveal rhythmic and CLOCK-driven pathways in human skeletal muscle,” eLife, vol. 7, p. e34114, Apr. 2018, doi: 10.7554/eLife.34114.

[2] U. Raudvere et al., “g:Profiler: a web server for functional enrichment analysis and conversions of gene lists (2019 update),” Nucleic Acids Research, vol. 47, no. W1, pp. W191–W198, Jul. 2019, doi: 10.1093/nar/gkz369.